This is my first try to analyse a tHPI dataset. Therefore, some of the decisions and consequently their results seen in this report might not be perfect. Though, I’m continuously working on this report and will improve it iteratively.
I start by importing a recent tHPI which shows the average price in USD for a standard double room, based on 25 of the most popular US cities. I also add the Latitude and Longitude of these 25 cities, which will be used for fancier visualizations. Here is the tHPI for top 6 cities of US.
## city January February March April May June July August September
## 1 New York 243 245 297 377 410 372 317 320 430
## 2 Los Angeles 197 196 203 219 219 227 252 245 220
## 3 Chicago 154 163 213 259 333 311 299 267 293
## 4 Dallas 148 143 154 158 150 148 145 147 157
## 5 Philadelphia 172 189 207 228 246 244 200 198 247
## 6 Houston 149 157 157 161 173 146 149 143 150
## October lat lon
## 1 430 40.71278 -74.00594
## 2 219 34.05223 -118.24368
## 3 307 41.83690 -87.68470
## 4 163 32.77670 -96.79700
## 5 241 39.95000 -75.16670
## 6 154 29.76040 -95.36980
I reshape the spreadsheet data into a four column data frame, with city, month, price, and date in first, second, third and forth columns, respectively. The first six rows of the new data frame are as following.
## city month price date
## 1 New York January 243 0015-01-01
## 2 Los Angeles January 197 0015-01-01
## 3 Chicago January 154 0015-01-01
## 4 Dallas January 148 0015-01-01
## 5 Philadelphia January 172 0015-01-01
## 6 Houston January 149 0015-01-01
A few minor data processing tasks will be done later during the EDA and modeling phases. Next, I’ll conduct Exploratory Data Analysis (EDA) for getting more insights into the dataset.
The following three plots, show the density of hotel prices.
According to the first plot, most of the hotel prices are around $150, however this price can go up to around $450. Based on the second plot, in January, most of the cities have the $150 price while in e.g., May the prices are much higher. The third plot, decomposes the price distribution of different cities. Which is hard to analyse as it is. Therefore, I continue with the following two boxplots.
Actually, the above two plots became my favorites. The former one, illustrates the high-season. The average price starts increasing in April and is maximum in July. Then drops in August.
The latter plot, immediately shows different type of cities. Boston and New York have very high prices, and their prices vary very much. While, Houston or Orlando are cheaper and much more stable.
The following heatmap illustrates how prices change across months and cities.
The lower right corner of the above heatmap, roughly, captures the cities and months which I see the highest prices. In this heatmap, I can also differentiate between cities with stable prices and cities with very varying prices.
Here I try to make categories for cities. First, I categorize each city as “Cheap Cities”, “Moderate Cities” and “Expensive Cities”. Second, I categorize them as “Stable Cities” and “Unstable Cities”. Before this categorization, let me extract the mean, standard deviation and range of prices from January to October, for each city. Following comes these statistics for the top 6 cities of US.
## city price.mean price.sd price.range price.mean.class
## 1 New York 344.1 70.345891 187 (284,359]
## 2 Los Angeles 219.7 18.541545 56 (210,284]
## 3 Chicago 259.9 62.939918 179 (210,284]
## 4 Dallas 151.3 6.429965 20 (135,210]
## 5 Philadelphia 217.2 27.336585 75 (210,284]
## 6 Houston 153.9 8.685237 30 (135,210]
## price.sd.class price.range.class
## 1 (44.5,82.9] (126,235]
## 2 (6.12,44.5] (17.8,126]
## 3 (44.5,82.9] (126,235]
## 4 (6.12,44.5] (17.8,126]
## 5 (6.12,44.5] (17.8,126]
## 6 (6.12,44.5] (17.8,126]
The change of prices in each category are shown in following plots.
Using all the above plots, my intuition is that fluctuation of prices are following certain patterns. I illustrate some patterns, namely up-down-up, up-up-down-up, up-up-up-up, up-up-up-down in the following four subplots.
Based on my observations in above plots, and considering that the traveling patterns follow the climate changes, school holidays (and other yearly and 6-monthly periodic events), I expect that a linear combination of some Sinus and Cosinus functions with periods of 12 months and 6 months can estimate the trend of prices.
It seems that every city is following a periodic trend. For this, I define these four predictors: \(sin(\frac{\pi t}{6}),cos(\frac{\pi t}{6}),sin(\frac{\pi t}{3}),cos(\frac{\pi t}{3})\) where \(t\) is the number of month ranging from 1 to 12 for January to December. Then, I use a linear model to predict the price of each month. See following two R functions.
get.predictors <- function(t) {
return(data.frame(t=t,
sin = sin(pi/6*t), # 12 month period
cos = cos(pi/6*t),
sin2 = sin(pi/3*t), # 6 month period
cos2 = cos(pi/3*t)
)
)
}
get.predictions <- function(price.trend,time.end) {
lmfit <- lm(y ~ sin+cos+sin2+cos2,data=price.trend)
prediction.time <- 1:time.end
return(price.prediction <- data.frame(t=prediction.time,y=predict(lmfit,get.predictors(prediction.time))))
}
To test this model I train one model for each city, based on the prices from January to September, and then compare its predictions for October prices with the real values.
After visually testing the predicted prices of October, I realized that for some of the cities the predictions are very promissing, while for some other cities predictions fail.
Some good predictions of October price are shown in following plots (the black dots are real values and the red dots and lines are my predictions).
For some other cities, the predictions don’t work that well. This can be because of unknown predictors that our model doesn’t take into account, or other reasons!
Overall, the predictions are not very bad; the trend of prices is usually followed by the model. Therefore, next, I use my model to predict the prices of November and December.
I use all available prices from January to October to train my model and then I make predictions for November and December prices. The predictions for all 25 US cities are illustrated in following (The black dots and lines show the real prices and red lines show the predictions).
Following comes my predictions for prices in next months. To verify these predictions, we can wait two more months and compare the results.
## city November December
## 1 New York 418 327
## 2 Los Angeles 201 195
## 3 Chicago 279 219
## 4 Dallas 161 153
## 5 Philadelphia 238 206
## 6 Houston 156 153
## 7 Washington D.C. 321 263
## 8 Miami 193 219
## 9 Atlanta 179 172
## 10 Boston 408 314
## 11 San Francisco 307 260
## 12 Detroit 169 167
## 13 Orlando 130 133
## 14 San Diego 149 144
## 15 Las Vegas 185 170
## 16 Phoenix 132 140
## 17 Indianapolis 127 117
## 18 Salt Lake City 141 134
## 19 Denver 197 186
## 20 New Orleans 219 209
## 21 St. Louis 153 137
## 22 Seattle 185 172
## 23 Minneapolis 232 198
## 24 San Antonio 107 108
## 25 San Jose 192 177
While training the models, for each city I get 4 model coefficients. I assume that these coefficients are the features for the cities. This allows me to use a K-means clustering algorithm to see if the coefficients can help me to make clusters for the cities.
A simple model with k=4 clusters have a goodness of 74.4%. The following figure illustrates the 25 cities on a map and each cluster is shown by a color.
The above map already captures some of the geographical and climate characteristics of cities. For instance, Seattle which indeed has a different climate from other cities is the single member of its own cluster. Besides, many cities in central and western parts of the US which have similar climates end up in the same cluster. These clusters should be studied further in the future.
In this technical report, I started by importing the tHPI data and applied some basic modifications and reshaping to data. Then I conducted EDA to get insights into this dataset. As a result of my EDA I realized that prices follow periodic patterns. Therefore, I decided to use \(Sinus\) and \(Cosinus\) functions to fit a model to data. First, I tested my model by leaving out the October prices from training data and then I visually compared my predictions with the real prices. Then, I trained my model based on all available prices and made predictions for November and December prices. Also, coefficients of my model helped me to summarize each city with four numeric features which I used to cluster the cities.
I emphasize that the analysis provided in this report is very basic, and the choices might be naive in some cases. To improve the results, more domain knowledge, more data and deeper analysis are required. Next section elaborates on some of the possible future improvements.
If I find more time, I’ll do the following.